Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80
Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80
Conversation
…g and optimized processing
…necessary wrapper
There was a problem hiding this comment.
Pull request overview
This pull request enhances the robustness and monitoring of dataset building operations in scripts/datasets/seq_for_mut_prob.py. It introduces fallback mechanisms for BioMart metadata downloads and adds progress tracking for Ensembl CDS retrieval, improving reliability when external services are unavailable or slow.
Changes:
- Refactored
download_biomart_metadatato include retry logic, fallback from archive to latest Ensembl server, and Python-based downloader when wget is unavailable - Added progress monitoring with tqdm for Ensembl CDS sequence retrieval
- Simplified multiprocessing by removing unnecessary wrapper function
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…andling and logging
…andling and fallback mechanism
…s before fallback
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
scripts/datasets/seq_for_mut_prob.py:956
- This branch is effectively unreachable and
sys.exit()is risky inside a library-style helper (and especially problematic under multiprocessing).r.raise_for_status()will raise for 4xx/5xx, sostatus = "ERROR"; sys.exit()won’t run; if the logic changes later it could unexpectedly terminate the whole process. Consider removing this block and handling non-OK responses via exceptions/retries and returningnp.nanon terminal failure.
if not r.ok:
r.raise_for_status()
status = "ERROR"
sys.exit()
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Comments suppressed due to low confidence (2)
scripts/datasets/seq_for_mut_prob.py:1456
- The main block (lines 1446-1456) contains hardcoded local paths for testing, which should not be in production code. While the PR description mentions removing hardcoded paths from build_datasets.py and adding a CLI-only guard, this file still has test code. Consider removing this test code or replacing it with a similar guard that directs users to use the CLI.
if __name__ == "__main__":
raise SystemExit(
"This module is intended to be used via the CLI: `oncodrive3d build-datasets`."
)
scripts/datasets/seq_for_mut_prob.py:1237
- The function signature changed to remove ens_canonical_transcripts_lst, custom_mane_metadata_path, and mane_version parameters, and add mane_only parameter. The docstring should be updated to reflect these changes and explain the new behavior, particularly around how mane_only affects the filtering of non-MANE sequences.
uniprot_to_gene_dict,
mane_mapping,
mane_mapping_not_af,
mane_only=False,
num_cores=1):
"""
Retrieve DNA sequence and tri-nucleotide context
for each structure in the initialized dataframe
prioritizing MANE associated structures and metadata.
Reference_info labels:
1 : Transcript ID, exons coord, seq DNA obtained from Proteins API
0 : Transcript ID retrieved from MANE and seq DNA from Ensembl
-1 : Not available transcripts, seq DNA retrieved from Backtranseq API
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… metadata retrieval
…recomputed symbol map and only add CGC + length on top
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…are_samplesheet.py
…wait time in get_ref_dna_from_ensembl_batch;
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
scripts/datasets/get_pae.py:63
- If the response is HTTP 200 but the body doesn’t match the expected JSON pattern (
content.endswith(b'}]')),statusremainsINIT, which causes the loop to immediately retry without the 30s backoff. Setstatus = "ERROR"(or otherwise sleep) when content validation fails to avoid tight retry loops and rate-limiting.
content = response.content
if content.endswith(b'}]') and not content.endswith(b'</Error>'):
with open(file_path, 'wb') as output_file:
output_file.write(content)
status = "FINISHED"
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -224,12 +224,17 @@ def get_pdb_seqres_records(lst_res): | |||
| def add_refseq_record_to_pdb(path_structure): | |||
| """ | |||
| Add the SEQREF records to the pdb file. | |||
There was a problem hiding this comment.
The docstring says “Add the SEQREF records”, but the code is inserting SEQRES records. Update the docstring to match the PDB record type to avoid confusion.
| Add the SEQREF records to the pdb file. | |
| Add the SEQRES records to the pdb file. |
| samplesheet = attach_metadata(samplesheet, metadata_map) | ||
| samplesheet.to_csv(paths.samplesheet_path, index=False) | ||
|
|
||
| metadata_for_outputs = metadata_map or symbol_map |
There was a problem hiding this comment.
metadata_for_outputs = metadata_map or symbol_map will raise ValueError: The truth value of a DataFrame is ambiguous when metadata_map is a DataFrame. Use an explicit is not None check (e.g., choose metadata_map if it’s not None, otherwise symbol_map).
| metadata_for_outputs = metadata_map or symbol_map | |
| metadata_for_outputs = metadata_map if metadata_map is not None else symbol_map |
…QRES already exists
…ce and reliability
…ing for improved reliability
Summary
This PR makes the build pipeline faster and more reliable under current Ensembl/BioMart/AF DB constraints. Ensembl CDS is now batched, BioMart downloads have fallbacks and clearer logs, and PAE handling is robust to missing AF DB versions with a new
--custom_pae_dir. MANE builds are explicitly pinned to AF DB v4 for consistency, and several edge‑case crashes plus doc/debug ergonomics were cleaned up. Default AF DB version for non‑MANE builds is now v6 (latest at the time of this PR). Backtranseq batching/retry/timeouts were hardened to avoid multi‑hour hangs.Issue 1: Ensembl CDS retrieval speed
Issue 2: BioMart metadata download instability
--no-hsts, and SSL verify for HTTPS fallback.download_single_file().Issue 3: PAE availability and custom input
--custom_pae_dir; skip download when 10 consecutive 404/410s detected.pae/; probe first 10 IDs sequentially, then parallel download.Issue 4: MANE + AF version consistency
af_version=4when--mane/--mane_onlyis used.merge_af_fragmentsnow receivesaf_version(fixes v4 hardcode when default is v6).--af_versionis now6.Issue 5: Backtranseq robustness
Issue 6: Sequence/PDB hygiene & mapping
Issue 7: Developer UX & tooling
prepare_samplesheet.pyso it runs directly from its folder (adds repo root tosys.path, avoidingModuleNotFoundError: scripts).update_samplesheet_and_structures.pyalways adds symbol to final bundles;samplesheet.csvis kept clean (no symbol/CGC/length).--include-metadatanow only adds CGC/length.build_datasets.pyand kept CLI‑only guard.